test: GB10 CUDA graph repro harness (T1.3, T1.5, T1.6)#95
Merged
Conversation
…agged) Adds TestCUDAGraph_MultiTensorUpload_GB10 behind //go:build dgxgb10 so CI never runs it. The Spark DGX pod (T1.4, next wave) will pass the tag to reproduce the hang on real GB10 hardware. The test uploads 50 float32 tensors (including a 256x1024 matrix), begins capture, runs a MatMul inside the capture region, and calls EndCapture. All three possible outcomes are observable: - EndCapture returns cleanly: E2 fix is in place (test passes). - ErrCaptureIncompatibleAllocation bubbles out: T1.2 probe caught the unsafe allocation synchronously (test passes). - Capture body does not complete in 30s: hang is live, test fails via context.WithTimeout + t.Fatal. Only compute/gpu_engine_gb10_test.go is added; no non-test files are touched.
… sentinel wrapping) Adds CPU-mock tests that close the Wave 1 gaps on the capture guard without requiring CUDA hardware: - ensureNotCapturing over all three CaptureStatus values (table-driven), the nil-Ptr branch, and probe-error propagation. - allocWeight and uploadBytes propagate the ErrCaptureIncompatibleAllocation sentinel and the wrapped probe error unchanged. - ErrCaptureIncompatibleAllocation survives fmt.Errorf %w wrapping. - cuda.StreamFromPtr(nil).Ptr() round-trips, and StreamCaptureStatus tolerates a zero handle when the runtime is unavailable. To enable probe-error and status-branch tests without CUDA, introduces a single-line indirection in compute/gpu_engine.go: var captureStatusFn = cuda.StreamCaptureStatus Tests swap it via swapCaptureStatusFn (test-only helper). Zero stub markers in production code; test fakes confined to *_test.go files. Verifies: [infrastructure]
- gofmt -s -w on compute/gpu_engine_gb10_test.go (trailing newline) - gofmt -s -w on internal/cuda/purego.go (field-alignment delta from cudaStreamGetCaptureInfo addition in T1.1) - Mark T1.3/T1.5/T1.6 complete in docs/plan.md Wave 2 of ztensor E1 closes out: hardware repro test (T1.3), CPU-mock coverage (T1.5), format/lint sweep (T1.6). Next: Wave 3 (T1.4 -- Spark submission of the dgxgb10 test for evidence capture), then Wave 4 (E2 fix work).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wave 2 of the GB10 CUDA graph capture hang fix (docs/plan.md E1). Builds out the reproduction harness on top of Wave 1's probe primitives (#94).
compute/gpu_engine_gb10_test.go(185 lines) gated behind//go:build dgxgb10. Uploads 50 float32 tensors (incl. a 256×1024 matrix), begins capture, runs one MatMul, ends capture — guarded bycontext.WithTimeout(30s). Accepts three outcomes: clean capture,ErrCaptureIncompatibleAllocation, ort.Fatalon hang.compute/capture_guard_test.go,compute/gpu_engine_alloc_guard_test.go,internal/cuda/runtime_purego_test.go. Closes coverage gaps:ensureNotCapturingover all threeCaptureStatusvalues (table-driven).ensureNotCapturingprobe-error propagation (does NOT masquerade asErrCaptureIncompatibleAllocation).ensureNotCapturingnil-Ptr branch.allocWeight+uploadByteseach propagate the sentinel and the probe error.ErrCaptureIncompatibleAllocationsurvivesfmt.Errorf("%w", ...)wrapping.cuda.StreamFromPtr(nil).Ptr()round-trip.cuda.StreamCaptureStatustolerates zero-handle stream when runtime is unavailable.var captureStatusFn = cuda.StreamCaptureStatusincompute/gpu_engine.goso tests can swap the probe. Call sitecuda.StreamCaptureStatus(s)→captureStatusFn(s).gofmt -s -won the two E1 files that drifted (compute/gpu_engine_gb10_test.gotrailing newline,internal/cuda/purego.gofield alignment aftercudaStreamGetCaptureInfowas added).golangci-lintdelta on./compute/... ./internal/cuda/...is zero (13 pre-existing issues in unrelated files).Hardware run (T1.4 — submit the dgxgb10 test via a Spark manifest) stays in Wave 3 and is not part of this PR.
Verification report
wave-2-task-T1.3thenwave-2-task-T1.5via--no-ff. Silent-revert check: every non-context line from each M1 patch reflected on the integration branch.go build ./...PASS.go test ./compute/... ./internal/cuda/... -race -timeout 120sPASS.go vet ./...→ 28 warnings, identical to origin/main baseline (no delta).golangci-lint run ./compute/... ./internal/cuda/...→ 13 pre-existing issues, 0 in E1 files, 0 new.gofmt -s -l/goimports -lclean across all E1 files after T1.6 sweep.TestCUDAGraph_MultiTensorUpload_GB10; infrastructure via 10 new CPU-mock tests.Files touched
compute/gpu_engine.go(+3 −1) — one-linecaptureStatusFnindirection for T1.5 testabilitycompute/capture_guard_test.go(+120) — extended guard coveragecompute/gpu_engine_alloc_guard_test.go(+113, new) — allocWeight/uploadBytes propagation testscompute/gpu_engine_gb10_test.go(+184, new, build-tagged) — hardware reprointernal/cuda/runtime_purego_test.go(+35) — binding-level gap testsinternal/cuda/purego.go(±13, field alignment from T1.1 addition)docs/plan.md— mark T1.3/T1.5/T1.6 completeTest plan
go build ./...go test ./compute/... ./internal/cuda/... -race -timeout 120sgo vet ./...(delta vs origin/main = 0)gofmt -s -l/goimports -lon E1 files (clean)golangci-lint run ./compute/... ./internal/cuda/...(0 new findings)cuda-graph-gb10-repro.yamlto Spark; attach log evidence